Using the IPython Notebook for Reproducible Parallel Computing



In [1]:

    
from IPython.display import display, Image, HTML
from talktools import website, nbviewer

SIAM Conference on Parallel Processing for Scientific Computing (PP 2014)

Brian E. Granger (@ellisonbg)

Physics Professor, Cal Poly

Core developer, IPython Project

The IPython Project

IPython is an open source, interactive computing environment for Python and other languages.



In [2]:

    
website('http://ipython.org')









    Out[2]:

Started in 2001 by Fernando Perez, who continues to lead the project from UC Berkeley
Open source, BSD license
Started as an enhanced interactive Python shell:

Today, IPython is a powerful architecture for interactive code execution:
- Language independent message specification for running code in and getting results from remote processes (JSON over WebSockets and ZeroMQ).
- IPython Frontends: web-based notebook, Qt Console, terminal console
- IPython.parallel: interactive parallel computing

See the following talk on IPython.parallel by Min Ragan-Kelley

Funding

Over the past 13 years, much of IPython has been "funded" by volunteer developer time.
Past funding: NASA, DOD, NIH, Enthought Corporation
Current funding:

Development team

IPython is developed by a talented team of $\approx15$ core developers and a larger community of $\approx100$ contributors.
Through the above funding sources, there are currently 6 full time people working on IPython at UC Berkeley and Cal Poly.



In [3]:

    
import ipythonproject



In [4]:

    
ipythonproject.core_devs()









    




Fernando Perez Brian Granger Min Ragan-Kelley Thomas Kluyver
Matthias Bussonnier Jonathan Frederic Paul Ivanov Evan Patterson
Damian Avila Brad Froehle Zach Sailer Robert Kern
Jorgen Stenarson Jonathan March Kyle Kelley

The IPython Notebook

The IPython Notebook is a web-based interactive computing environment that spans the full range of computing related activities:

Individual exploration, analysis and visualization
Debugging, testing
Production runs
Parallel computing
Collaboration
Publication
Presentation
Teaching/Learning

How does IPython target these different activities?

Interactive exploration

The central focus of IPython is the writing and running of code. We try to make this as pleasant as possible:

Multiline editing
Tab completion
Integrated help
Syntax highlighting
System shell access

Let's use NumPy and Matplotlib to look at the eigenvalue spacing distribution of random matrices:



In [5]:

    
%matplotlib inline



In [6]:

    
import matplotlib.pyplot as plt
import seaborn
import numpy as np
ra = np.random
la = np.linalg



In [7]:

    
def GOE(N):
    """Creates an NxN element of the Gaussian Orthogonal Ensemble"""
    m = ra.standard_normal((N,N))
    m += m.T
    return m/2

def center_eigenvalue_diff(mat):
    """Compute the eigvals of mat and then find the center eigval difference."""
    N = len(mat)
    evals = np.sort(la.eigvals(mat))
    diff = np.abs(evals[N/2] - evals[N/2-1])
    return diff

def ensemble_diffs(num, N):
    """Return num eigenvalue diffs for the NxN GOE ensemble."""
    diffs = np.empty(num)
    for i in range(num):
        mat = GOE(N)
        diffs[i] = center_eigenvalue_diff(mat)
    return diffs/diffs.mean()



In [8]:

    
diffs = ensemble_diffs(1000,30)



In [9]:

    
plt.hist(diffs, bins=30, normed=True)
plt.xlabel('Normalized eigenvalue spacing s')
plt.ylabel('Probability $P(s)$')









    Out[9]:





<matplotlib.text.Text at 0x10b6c2ad0>

Common shell commands (ls, cd) just work:



In [10]:

    
ls









    



LICENSE             data/               ipythonproject.pyc  load_style.pyc      talktools.py
README.md           images/             ipythonteam/        lorenz.py           talktools.pyc
SIAM Talk.ipynb     ipythonproject.py   load_style.py       talk.css

Manage small files in the notebook using the %%writefile magic command:



In [11]:

    
%%writefile data/mydata.csv
0 1 2 3 4 5 6 7 8 9 10









    



Overwriting data/mydata.csv

Any command prefixed with the ! is run in the system shell:



In [12]:

    
!cat data/mydata.csv









    



0 1 2 3 4 5 6 7 8 9 10

What does this have to do with parallel computing?

The canonical user interface to clusters and supercomputers is a terminal session over SSH. Ouch. This is extremely painful for the user and makes it almost impossible to reproduce the workflow. Here is a simple recipe for making parallel computing reproducible and literate:

Install and run the IPython Notebook on the head node
Write notebooks that create input files, submit jobs, perform post processing, visualization
Provide inline narrative descriptions of the workflow
Store the notebooks in a version control system (git, svn, etc.)

Multiple backend languages

Scientific computing is a multi-language activity. Python, C, C++, Fortran, Perl, Bash, etc. The IPython architecture is language agnostic.

For statistical computing, R is a great option. Let's fit a linear model in R and visualize the results:



In [13]:

    
import numpy as np
X = np.array([0,1,2,3,4])
Y = np.array([3,5,4,6,7])
%load_ext rmagic

The %%R syntax tells IPython to run the rest of the cell as R code:



In [14]:

    
%%R -i X,Y -o XYcoef
XYlm = lm(Y~X)
XYcoef = coef(XYlm)
print(summary(XYlm))
par(mfrow=c(2,2))
plot(XYlm)









    





Call:
lm(formula = Y ~ X)

Residuals:
   1    2    3    4    5 
-0.2  0.9 -1.0  0.1  0.2 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)   3.2000     0.6164   5.191   0.0139 *
X             0.9000     0.2517   3.576   0.0374 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7958 on 3 degrees of freedom
Multiple R-squared:   0.81,	Adjusted R-squared:  0.7467 
F-statistic: 12.79 on 1 and 3 DF,  p-value: 0.03739

This %%language syntax is an IPython specific extension to the Python language. This "magic command syntax" allows Python code to call out to a wide range of other languages (Ruby, Bash, Julia, Fortran, Perl, Octave, Matlab, etc.)



In [15]:

    
%%ruby
puts "Hello from Ruby #{RUBY_VERSION}"









    



Hello from Ruby 2.0.0



In [16]:

    
%%bash
echo "hello from $BASH"









    



hello from /bin/bash

Native kernels

In the IPython architecture, the kernel is a separate process that runs the user's code and returns the output back to the frontend (Notebook, Terminal, etc.). Kernels talk to frontends using a well documented message protocol (JSON over ZeroMQ and WebSockets). The default kernel that ships with IPython knows how to run Python code. However, there are now kernels in other languages:

Julia (https://github.com/JuliaLang/IJulia.jl)
Ruby (https://github.com/minrk/iruby)
Haskell (https://github.com/gibiansky/IHaskell)
Scala (https://github.com/Bridgewater/scala-notebook)
node.js (https://gist.github.com/Carreau/4279371)
Go (https://github.com/takluyver/igo)
R and Matlab are in the works

By later this year, all users of the IPython Notebook will have the option to choose what type of kernel to use for each Notebook.

Here is a notebook that runs code in the native Julia kernel:



In [17]:

    
website("http://nbviewer.ipython.org/url/jdj.mit.edu/~stevenj/IJulia%20Preview.ipynb")









    Out[17]:

Notebook documents

Notebook documents are just JSON files stored on your filesystem. These files store everything related to a computation:

Code
Output (text, HTML, plots, images, JavaScript)
Narrative text (Markdown with embedded LaTeX math)

Notebook documents can be shared:

GitHub repos
Email
Dropbox
Internal shared file systems

Notebook documents can be viewed by anyone on the web through http://nbviewer.ipython.org



In [18]:

    
website("http://nbviewer.ipython.org")









    Out[18]:

This allows people to compose and share reproducible stories that involve code and data.

Earlier this year, Randall Munroe (xkcd) published a comic about regular expression golf. Peter Norvig from Google wanted to explore some of the algorithms related to this comic and shared his explorations as a notebook on nbviewer:



In [20]:

    
website("http://nbviewer.ipython.org/url/norvig.com/ipython/xkcd1313.ipynb")









    Out[20]:

Rich output

IPython has a display system for rich output formats. This rich display system allows Python objects to declare non-textual representations that can be displayed in the Notebook. These rich representations include:

PNG/JPEG
HTML
JavaScript
LaTeX
SVG

These rich representaions are displayed using IPython's display function:



In [21]:

    
from IPython.display import HTML, Image, YouTubeVideo, Audio, Latex

Here is an Image object whose representation is an image:



In [22]:

    
i = Image('images/ipython_logo.png')



In [23]:

    
display(i)

The Audio object has a representation that is an HTML5 audio player:



In [24]:

    
a = Audio('data/Bach Cello Suite #3.wav')



In [25]:

    
display(a)

The Latex object produces a representation that is rendered LaTeX. In this case, Maxwell's equations:



In [26]:

    
Latex(r"""\begin{eqnarray}
\nabla \times \vec{\mathbf{B}} -\, \frac1c\, \frac{\partial\vec{\mathbf{E}}}{\partial t} & = \frac{4\pi}{c}\vec{\mathbf{j}} \\
\nabla \cdot \vec{\mathbf{E}} & = 4 \pi \rho \\
\nabla \times \vec{\mathbf{E}}\, +\, \frac1c\, \frac{\partial\vec{\mathbf{B}}}{\partial t} & = \vec{\mathbf{0}} \\
\nabla \cdot \vec{\mathbf{B}} & = 0 
\end{eqnarray}""")









    Out[26]:





\begin{eqnarray}
\nabla \times \vec{\mathbf{B}} -\, \frac1c\, \frac{\partial\vec{\mathbf{E}}}{\partial t} & = \frac{4\pi}{c}\vec{\mathbf{j}} \\
\nabla \cdot \vec{\mathbf{E}} & = 4 \pi \rho \\
\nabla \times \vec{\mathbf{E}}\, +\, \frac1c\, \frac{\partial\vec{\mathbf{B}}}{\partial t} & = \vec{\mathbf{0}} \\
\nabla \cdot \vec{\mathbf{B}} & = 0 
\end{eqnarray}

The YouTubeVideo object embeds the HTML for a YouTube video in the notebook:



In [27]:

    
YouTubeVideo('sjfsUzECqK0')









    Out[27]:

Interacting with data

Data exploration is an iterative process that involves repeated passes at visualization, interaction and computation:



In [28]:

    
Image('images/VizInteractCompute.png')









    Out[28]:

Right now this cycle is still really painful:

It takes too long to go through a single iteration
Even when we are successful, the overall process is not reproducible
Difficult to repeat, generalize or share with others
Massive cognitive load that has nothing to do with extracting insight from the data

For IPython 2.0 we have built an architecture that allows Python and JavaScript to communicate seamlessly and in real time. This allows users to easily iterate through this cycle.

Image editing

In this example, we will perform some basic image processing using scikit-image.



In [29]:

    
from IPython.html.widgets import *



In [30]:

    
import skimage
from skimage import data, filter, io



In [33]:

    
i = data.coffee()
io.Image(i)









    Out[33]:



In [34]:

    
def edit_image(image, sigma=0.1, r=1.0, g=1.0, b=1.0):
    new_image = filter.gaussian_filter(image, sigma=sigma, multichannel=True)
    new_image[:,:,0] = r*new_image[:,:,0]
    new_image[:,:,1] = g*new_image[:,:,1]
    new_image[:,:,2] = b*new_image[:,:,2]
    new_image = io.Image(new_image)
    display(new_image)
    return new_image

Calling the function once, displays and returns the edited image:



In [35]:

    
new_i = edit_image(i, 0.5, r=0.5);



In [36]:

    
lims = (0.0,1.0,0.01)
interact(edit_image, image=fixed(i), sigma=(0.0,10.0,0.1), r=lims, g=lims, b=lims);

Lorenz system

Let's explore the Lorenz system of differential equations:

$$ \begin{aligned} \dot{x} & = \sigma(y-x) \\ \dot{y} & = \rho x - y - xz \\ \dot{z} & = -\beta z + xy \end{aligned} $$

This is one of the classic systems in non-linear differential equations. It exhibits a range of different behaviors as the parameters ($\sigma$, $\beta$, $\rho$) are varied.



In [37]:

    
from IPython.html.widgets import interact, fixed
from IPython.display import clear_output, display, HTML

Here is a Python function that solves the Lorenz systems using SciPy and plots the results using matplotlib:



In [38]:

    
from lorenz import solve_lorenz



In [39]:

    
t, x_t = solve_lorenz(N=10, angle=0.0, max_time=4.0, sigma=10.0, beta=8./3, rho=28.0)

Let's use interact to explore this function:



In [40]:

    
interact(solve_lorenz, angle=(0.,360.), N=(0,50), sigma=(0.0,50.0),
         rho=(0.0,50.0), beta=fixed(8./3));

Conclusion

The IPython Notebook enables users to tell reproducible stories that involve code and data

The scripts and command line programs used in the traditional parallel computing workflow can all be managed and run from within the Notebook

The Python ecosystem provides a rich foundation for data analysis, visualization, algorithm development, parallel development



In [41]:

    
%load_ext load_style



In [ ]:

    
%load_style talk.css



In [ ]:

Fernando Perez	Brian Granger	Min Ragan-Kelley	Thomas Kluyver
Matthias Bussonnier	Jonathan Frederic	Paul Ivanov	Evan Patterson
Damian Avila	Brad Froehle	Zach Sailer	Robert Kern
Jorgen Stenarson	Jonathan March	Kyle Kelley